Analysis

Natural Resource Policy

Importing libraries

library(dplyr)
library(readxl)
library(tidygeocoder)
library(sf)
library(mapview)
library(RColorBrewer)
library(plotly)

Importing data

data <- read_excel("geo_NCdata.xlsx")

Natural Resource data

nr_data <- select(data, c("City", "Availability of Water", "Agricultural Potential",
                       "Mining Potential", "Tourism Potential", "Environmental Sensitivity",
                       "latitude", "longitude"))
head(nr_data)
## # A tibble: 6 x 8
##   City     `Availability of~ `Agricultural Po~ `Mining Potenti~ `Tourism Potent~
##   <chr>                <dbl>             <dbl>            <dbl>            <dbl>
## 1 Aggeney~              19.2              26.4             70.3             16.2
## 2 Alexand~              34.4              20.1             66.4             54.0
## 3 Askham,~              19.0              32.7             34.3             54.4
## 4 Augrabi~              34.3              45.6             30.7             54.2
## 5 Barkly ~              42.0              56.1             76.5             41.2
## 6 Brandvl~              17.0              23.1             28.1             17.7
## # ... with 3 more variables: Environmental Sensitivity <dbl>, latitude <dbl>,
## #   longitude <dbl>

Distribution Analysis

fig <- nr_data %>%
  plot_ly(
    y = ~`Availability of Water`,
    type = 'violin',
    box = list(visible = T),meanline = list(visible = T), x0 = 'Availability of Water') 
fig <- fig %>%
  layout(
    title = "Distribution of Availability of Water",
    yaxis = list(title = "%", zeroline = F))

fig

Cities/Towns that are not geocoded

nr_data[rowSums(is.na(nr_data)) > 0,]$City
## [1] "Delpoortshoop, Northern Cape"    "Olynvenhoutsdrif, Northern Cape"
## [3] "Phillipstown, Northern Cape"     "Soverby, Northern Cape"

Removing Cities that are not geocoded

locations_nr <- subset(nr_data, !is.na(nr_data$longitude) & !is.na(nr_data$latitude))

K-means Cluster Analysis

Clustering the data

Standardizing data

  • Standardizing (scaling) data to remove variations due to different measurement scales
locations_nr_scale <- scale(select(locations_nr,
                                   c("Availability of Water", "Agricultural Potential",
                       "Mining Potential", "Tourism Potential", "Environmental Sensitivity")))

Assessing Clustering Tendency (ACT)

  • ACT evaluates whether the data set contains meaningful clusters or not (feasibility of the cluster analysis)
  • Method: Statistical (Hopkins statistic)
  • The Hopkins statistic is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution,it tests the spatial randomness of the data.
  • A Hopkins statistic(H) value of about 0.5 means that the data is uniformly distributed
  • Null hypothesis: the data set D is uniformly distributed (i.e., no meaningful clusters)
  • Alternative hypothesis: the data set D is not uniformly distributed (i.e.contains meaningful clusters)
  • If the value of Hopkins statistic is close to zero, then we can reject the null hypothesis and conclude that the data set D is significantly clusterable
#hopkins(locations_nr_scale, n = nrow(locations_nr_scale)-1)

Estimating the optimal number of clusters

  • Methods: Elbow method (within sum of square) and Silhouette method
  • library: factoextra
library(factoextra)
fviz_nbclust(locations_nr_scale, kmeans, method = "wss")

fviz_nbclust(locations_nr_scale, kmeans, method =  "silhouette")

K-means Clustering

  • 7 number of clusters will be ideal for grouping observation as shown in the estimation methods above
set.seed(123)
locations_nr_cluster <- kmeans(locations_nr_scale, 
                               centers = 7, nstart = 25)
library(ggplot2)
library(plotly)

Cluster Visual Assessment

  • Observations are represented by points in the plot, using principal components if ncol(data) > 2.
  • PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.
ggplotly(fviz_cluster(locations_nr_cluster, data = locations_nr_scale) +
           theme_minimal() +
           theme(legend.position = "none") +
           ggtitle("Natural Resource Clusters (Groups)"))

Adding the clusters to the Natural Resource Data Frame

locations_nr$Cluster <- as.factor(locations_nr_cluster$cluster)
head(locations_nr)
## # A tibble: 6 x 9
##   City     `Availability of~ `Agricultural Po~ `Mining Potenti~ `Tourism Potent~
##   <chr>                <dbl>             <dbl>            <dbl>            <dbl>
## 1 Aggeney~              19.2              26.4             70.3             16.2
## 2 Alexand~              34.4              20.1             66.4             54.0
## 3 Askham,~              19.0              32.7             34.3             54.4
## 4 Augrabi~              34.3              45.6             30.7             54.2
## 5 Barkly ~              42.0              56.1             76.5             41.2
## 6 Brandvl~              17.0              23.1             28.1             17.7
## # ... with 4 more variables: Environmental Sensitivity <dbl>, latitude <dbl>,
## #   longitude <dbl>, Cluster <fct>

Cluster Mean

  • Creating a Natural Resource Cluster data frame
nr_clust <- select(locations_nr, c("Availability of Water", "Agricultural Potential",
                       "Mining Potential", "Tourism Potential", "Environmental Sensitivity"))
  • Computing the cluster mean the different Natural Resources
  • This informs on how natural resources vary by group
  • The cluster centers assist in evaluating the distinctness of clusters. Thereby, suggesting whether or not cluster analysis was executed properly
nr_clust_table <- aggregate(nr_clust,
                            by=list(cluster= locations_nr_cluster$cluster),
                            mean)
nr_clust_table
##   cluster Availability of Water Agricultural Potential Mining Potential
## 1       1              24.58120               28.87901         52.73583
## 2       2              39.71313               57.21372         68.73526
## 3       3              34.99344               38.45660         35.36125
## 4       4              24.66475               34.42678         27.20825
## 5       5              33.87909               42.32195         75.05238
## 6       6              35.43283               46.15791         38.63769
## 7       7              20.51299               26.72645         25.11277
##   Tourism Potential Environmental Sensitivity
## 1          41.46333                  31.25629
## 2          25.83522                  37.47207
## 3          22.59556                  42.73840
## 4          28.99601                  63.86612
## 5          15.95009                  58.56706
## 6          53.02939                  37.06212
## 7          53.50389                  42.81456

Natural Resource Clusters

# locations_nr %>%
#       group_by(Cluster) %>%
#       summarise(n = n()) %>%
#       arrange(n) %>%
#       mutate(Cluster = factor(Cluster, levels = unique(Cluster))) %>%
#       plot_ly(x = ~n, y = ~Cluster, type = "bar") %>%
#       layout(title = "Natural Resource Grouping", yaxis = list(title = "Cluster"),
#              xaxis = list(title = "Number of Cities/Towns"))

ggplotly(locations_nr %>%
      group_by(Cluster) %>%
      summarise(No_of_Cities = n()) %>%
      arrange(No_of_Cities) %>%
      mutate(Cluster = factor(Cluster, levels = unique(Cluster))) %>%
      ggplot(aes(x = Cluster, y = No_of_Cities)) +
      geom_bar(stat = "identity",
               fill = "#1f77b4") +
      geom_text(aes(label = No_of_Cities),
                vjust = -0.25) +
      coord_flip() +
      labs(x = "Cluster", 
           y = "Number of Cities/Towns",
           title = "Natural Resource Grouping (Clusters)") +
      theme_minimal())

Viewing Mapview according to clusters

  • SF object of cluster data for Natural Resources
Natural_Resource <- st_as_sf(locations_nr, coords = c("longitude", "latitude"), crs = 4326)
mapview(Natural_Resource, 
        zcol = "Cluster")